Template 6:

One continuous & one categorical independent variable (line graph)

Introduction

Welcome! This template will guide you through a Bayesian analysis in R, even if you have never done Bayesian analysis before. There are a set of templates, each for a different type of analysis. This template is for data with one continuous and one categorical independent variable and will produce a line chart. If your analysis includes a linear regression, this might be the right template for you.

This template assumes you have basic familiarity with R. Once complete, this template will produce a summary of the analysis, complete with parameter estimates and credible intervals, and two animated HOPs (see Hullman, Resnick, Adar 2015 DOI: 10.1371/journal.pone.0142444 and Kale, Nguyen, Kay, and Hullman VIS 2018 for more information) for both your prior and posterior estimates.

This Bayesian analysis focuses on producing results in a form that are easily interpretable, even to nonexperts. The credible intervals produced by Bayesian analysis are the analogue of confidence intervals in traditional null hypothesis significance testing (NHST). A weakness of NHST confidence intervals is that they are easily misinterpreted [sources for all of this]. Many people naturally interpret an NHST 95% confidence interval to mean that there is a 95% chance that the true parameter value lies somewhere in that interval; in fact, it means that if the experiment were repeated 100 times, 95 of the resulting confidence intervals would include the true parameter value. The Bayesian credible interval sidesteps this complication by providing the intuitive meaning: a 95% chance that the true parameter value lies somewhere in that interval. To further support intuitive interpretations of your results, this template also produces animated HOPs plots, a type of plot that is more effective than visualizations such as error bars in helping people make accurate judgments about probability distributions.

This set of templates supports a few types of statistical analysis. (In future work, this list of supported statistical analyses will be expanded.) For clarity, each type has been broken out into a separate template, so be sure to select the right template before you start! A productive way to choose which template to use is to think about what type of chart you would like to produce to summarize your data. Currently, the templates support the following:

One independent variable:

  1. Categorical; bar graph (e.g. t-tests, one-way ANOVA)

  2. Ordinal; line graph (e.g. t-tests, one-way ANOVA)

  3. Continuous; line graph (e.g. linear regression)

Two independent variables:

  1. Two categorical; bar graph (e.g. two-way ANOVA)

  2. One categorical, one ordinal; line graph (e.g. two-way ANOVA)

  3. One categorical, one continuous; line graph (e.g. linear regression with multiple lines)

Note that this template fits your data to a model that assumes normally distributed error terms. (This is the same assumption underlying t-tests, ANOVA, etc.) This template requires you to have already run diagnostics to determine that your data is consistent with this assumption; if you have not, the results may not be valid.

Once you have selected your template, to complete the analysis, please follow along this template. For each code chunk, you may need to make changes to customize the code for your own analysis. In those places, the code chunk will be preceded by a list of things you need to change (with the heading “What to change”), and each line that needs to be customized will also include the comment #CHANGE ME within the code chunk itself. You can run each code chunk independently during debugging; when you’re finished, you can knit the document to produce the complete document.

Good luck!

Sample dataset

This template comes prefilled with an example dataset from Moser et al. (DOI: 10.1145/3025453.3025778), which examines choice overload in the context of e-commerce. The study examined the relationship between choice satisfaction (measured at a 7-point Likert scale), the number of product choices presented on a webpage, and whether the participant is a decision “maximizer” (a person who examines all options and tries to choose the best) or a “satisficer” (a person who selects the first option that is satisfactory). In this template, we analyze the relationship between choice set size, which we treat as a continuous variable in this template with values that can fall in the range [12,72]; type of decision-making (maximizer or satisficer), a two-level categorical variable; and choice satisfaction, which we treat as a continuous variable with values that can fall in the range [1,7].

Set up

Libraries

If this is your first time using the template, you may need to install libraries. Uncomment the lines below - install.packages() and devtools::install_github() - to install the required packages. This only needs to be done once.

knitr::opts_chunk$set(fig.align="center") 

# install.packages("rstanarm", "tidyverse", "tidybayes", "modelr", "devtools")
# devtools::install_github("thomasp85/gganimate")

library(rstanarm) #bayesian analysis package
library(tidyverse) #tidy datascience commands
library(tidybayes) #tidy data + ggplot workflow
library(modelr) #tidy pipelines when modelling
library(ggplot2) #plotting package
library(gganimate) #animate ggplots

theme_set(theme_light()) # set the ggplot theme for all plots 

Read in data

What to change

  1. mydata: Read in your data.
mydata = read.csv('datasets/choc_cleaned_data.csv') #CHANGE ME 1

Specify model

We’ll fit the following model: stan_glm(y ~ x), which specifies a linear regression where each \(y_i\) is drawn from a normal distribution with mean equal to \(a + bx_i\) and standard deviation equal to sigma (\(\sigma\)):

\[ y_i \sim Normal(a + bx_i, \sigma) \]

Choose your independent and dependent variables. These are the variables that will correspond to the x and y axis on the final plots.

What to change

  1. mydata$x1: Select which variables will appear on the x-axis of your plots.

  2. mydata$x2: Select the second independent variable that will group your data; think of this as the group aesthetic in ggplot.

  3. mydata$y: Select which variables will appear on the y-axis of your plots.

  4. x_lab: Label your plots’ x-axes.

  5. y_lab: Label your plots’ y-axes.

#select your independent and dependent variables
mydata$x1 = mydata$num_products_displayed #CHANGE ME 2
mydata$x2 = mydata$sat_max #CHANGE ME 3
mydata$y = mydata$satis_Q1 #CHANGE ME 4

# label the axes on the plots
x_lab = "Choices" #CHANGE ME 5
y_lab = "Satisfaction" #CHANGE ME 6

Set priors

In this section, you will set priors for your model. Setting priors thoughtfully is important to any Bayesian analysis, especially if you have a small sample of data that you will use for fitting for your model. The priors express your best prior belief, before seeing any data, of reasonable values for the model parameters. The model estimation process produces a posterior distribution: beliefs about the plausible values for parameters, given the data in your dataset.

Ideally, you will have previous literature from which to draw these prior beliefs. If no previous studies exist, you can instead assign “weakly informative priors” that only minimally restrict the model; for example, a weakly informative prior for a parameter that can only have values between 1 and 7 would assign a very small probability to values outside of that range. We have provided an example of how to set priors below.

To check the plausibility of your priors, use the code section after this one to generate a graph of 100 plausible fit lines from your priors to check if the values generated are reasonable.

Our model has the following parameters: a. the y-value at the intercept b. the slope of the line c. the standard deviation of the normally distributed error term

To simplify things, we only let you specify beliefs about these parameters in the form of a normal distribution. Thus, you will specify what you think is the most likely value for the parameter (the mean), and a standard deviation. You will be expressing a belief that you were 95% certain (before looking at any data) that the true value of the parameter is within two standard deviations of the mean.

Finally, our modeling system, stan_glm(), will automatically set priors for the last parameter, the standard deviation of the normally distributed error term for the model overall. That is almost never a model parameter that you will be interpreting, and STAN does a reasonable job of assigning weakly informative priors for that parameter in a way that it won’t have an impact on the estimation of other parameters, which we normally do care about.

What to change

  1. a_prior: Select the intercept mean.

  2. a_sd: Select the intercept standard deviation.

  3. b1_prior: Select the slope mean.

  4. b1_sd: Select the slope standard deviation.

  5. You should also change the comments in the code below to explain your choice of priors.

# CHANGE THIS COMMENT EXPLAINING YOUR CHOICE OF PRIORS (11)
# In our example dataset, y-axis scores can be in the range [1, 7].
# Thus, a mean value in the control condition of less than 1 or greater
# than 7 is impossible. With a normal distribution, we can't completely
# rule out those impossible values, but we choose a mean and sd that assign
# less than 5% probability to those impossible values.

# We select the mean of the range [1,7], and an sd that assigns a 
# 95% probability to values that vary up to +3/-3 from this mean.

a_prior = 4  # CHANGE ME 7
a_sd = 1.5   # CHANGE ME 8

# CHANGE THIS COMMENT EXPLAINING YOUR CHOICE OF PRIORS (11)
# In this example, we will say we do not have a strong hypothesis about the effect
# of choice set size on satisfaction, so we set the mean of the effect size 
# parameters to be 0. In the absence of other information, we set the sd so that 
# a change from the minimum choice set size (12) to the maximum choice set size (72)
# could plausibly result in a +6/-6 change in satisfaction, the maximum possible change.

b1_prior = 0  # CHANGE ME 9
b1_sd = (6/(72-12))/2   # CHANGE ME 10

Check priors

Next, you’ll want to check your priors by running this code chunk. It will produce a set of 100 sample draws drawn from the priors you set in the previous section, so you can check to see if the values generated are reasonable. (We’ll go into the details of this code later.)

What to change

Nothing! Just run this code to check your priors, adjusting prior values above as needed until you find reasonable prior values. Note that you may get a couple of very implausible or even impossible values because our assumption of normally distributed priors assigns a small probability to even very extreme values. If you are concerned by the outcome, you can try rerunning it a few more times to make sure that any implausible values you see don’t come up very often.

#generate sample draws from the priors
m_prior = stan_glm(y ~ x1*x2, data = mydata,
  prior_intercept = normal(a_prior, a_sd, autoscale = FALSE),
  prior = normal(b1_prior, b1_sd, autoscale = FALSE),
  prior_PD = TRUE
)

#create the dataframe for fitted draws & create the plot
mydata %>% 
  data_grid(x1, x2) %>%
  add_fitted_draws(m_prior, n = 100, seed = 12345) %>%
  ggplot(aes(x = x1, y = .value, col = x2, group = x2)) +
    geom_line(aes(group = .draw), alpha = .2) +
    facet_grid(cols = vars(x2)) +
  # coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
    theme(plot.title = element_text(hjust = 0.5),
        legend.position="none") +
    labs(x=x_lab, y=y_lab) + # axes labels 
    ggtitle("100 sample draws from the priors") 

Run the model

There’s nothing you have to change here. Just run the model.

m = stan_glm(y ~ x1*x2, data = mydata,
  prior_intercept = normal(a_prior, a_sd, autoscale = FALSE),
  prior = normal(b1_prior, b1_sd, autoscale = FALSE)
)

Model summary

Here is a summary of the model fit:

summary(m, digits=3)
## 
## Model Info:
## 
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      y ~ x1 * x2
##  algorithm:    sampling
##  priors:       see help('prior_summary')
##  sample:       4000 (posterior sample size)
##  observations: 611
##  predictors:   4
## 
## Estimates:
##                   mean     sd       2.5%     25%      50%      75%   
## (Intercept)        6.168    0.098    5.975    6.100    6.169    6.232
## x1                -0.002    0.002   -0.006   -0.004   -0.002   -0.001
## x2satisficer       0.009    0.049   -0.084   -0.024    0.010    0.043
## x1:x2satisficer    0.004    0.002    0.000    0.002    0.004    0.005
## sigma              1.024    0.030    0.967    1.003    1.024    1.044
## mean_PPD           6.133    0.059    6.021    6.093    6.132    6.174
## log-posterior   -887.737    1.529 -891.375 -888.570 -887.442 -886.580
##                   97.5% 
## (Intercept)        6.362
## x1                 0.002
## x2satisficer       0.104
## x1:x2satisficer    0.008
## sigma              1.083
## mean_PPD           6.246
## log-posterior   -885.684
## 
## Diagnostics:
##                 mcse  Rhat  n_eff
## (Intercept)     0.002 1.000 4000 
## x1              0.000 1.000 4000 
## x2satisficer    0.001 0.999 4000 
## x1:x2satisficer 0.000 0.999 3876 
## sigma           0.000 0.999 4000 
## mean_PPD        0.001 1.000 4000 
## log-posterior   0.036 1.001 1827 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Visualizing results

Given this model, we might want to plot the fit line with credible bands around it. To do that, we will first construct a fit grid: a data frame of points at which we want to calculate the value of the fit line from the model. The data_grid function allows us to do this easily, e.g. by asking for 20 equally spaced points along the value of the x variable in our original data:

mydata %>% 
  data_grid(x1 = seq_range(x1, n = 20), x2)
## # A tibble: 40 x 2
##       x1 x2        
##    <dbl> <fct>     
##  1  12   maximizer 
##  2  12   satisficer
##  3  15.2 maximizer 
##  4  15.2 satisficer
##  5  18.3 maximizer 
##  6  18.3 satisficer
##  7  21.5 maximizer 
##  8  21.5 satisficer
##  9  24.6 maximizer 
## 10  24.6 satisficer
## # ... with 30 more rows

Given this grid, we can then draw samples from the posterior mean evaluated at each x position in the grid using the add_fitted_draws function, and then summarize these samples in ggplot using a stat_lineribbon. For readability, we’ll separate the \(x2\) levels into facets:

mydata %>% 
  data_grid(x1 = seq_range(x1, n = 20), x2) %>%
  add_fitted_draws(m) %>%
  ggplot(aes(x = x1, y = .value, group = x2)) +
  stat_lineribbon() +
  coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
  scale_fill_brewer() + 
  facet_grid(cols = vars(x2))

But what we really want is to display a selection of plausible fit lines, say 100 of them. To do that, we instead ask add_fitted_draws for only 100 draws, which we plot separately as lines:

mydata %>% 
  data_grid(x1 = seq_range(x1, n = 101), x2) %>%
  # the seed argument is for reproducibility: it ensures the pseudo-random
  # number generator used to pick draws has the same seed on every run,
  # so that someone else can re-run this code and verify their output matches
  add_fitted_draws(m, n = 100, seed = 12345) %>%
  ggplot(aes(x = x1, y = .value, col=x2, group=x2)) +
    coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
    geom_line(aes(group = .draw), alpha = .2) +
    facet_grid(cols = vars(x2)) +
    theme(plot.title = element_text(hjust = 0.5),
        legend.position="none") +
    labs(x=x_lab, y=y_lab) # axes labels 

Even better would be to animate this graph using HOPs (Hypothetical Outcomes Plot), a type of plot that visualizes uncertainty as sets of draws from a distribution, which has been demonstrated to improve multivariate probability estimates (Hullman et al. 2015) and increase sensitivity to the underlying trend in data (Kale et al. 2018) over static representations of uncertainty like error bars.

To set up the HOPs plots, we will first set the aesthetics for the ggplot that we will use.

Plot aesthetics

You can set the aesthetics of your HOPs plots here.

What to change

In most cases, the default values here should be just fine. If you want to adjust the aesthetics of the animated plots later, you can do so here using the ggplot2 package; just be sure to keep the lines that are commented with “do not change.” Below are two optional code customizations that we think may be particularly useful for some datasets.

  1. [Optional] coord_cartesian(ylim = …): You may want to manually set the y-axis limits. If so, uncomment this line in the code below and set your preferred limits accordingly.

  2. [Optional] scale_x_discrete(limits = …): You may want to manually set the order of the x-axis levels; for example, if you have levels “before” and “after,” ggplot defaults often plot “after” on the left and “before” on the right. If so, uncomment this line in the code below and set your preferred level order. The names of the levels must match what is in your dataset.

# the default code for the plots - if needed, the animated plot aesthetics can be customized here
graph_plot = function(data) {
  ggplot(data, aes(x = x1, y = .value, col=x2, group=x2)) + #do not change
    geom_line() + #do not change
    transition_states(.draw, transition_length = 1, state_length = 1) + # gganimate code to animate the plots. Do not change
    coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME 11 (optional)
    labs(x=x_lab, y=y_lab) # axes labels
}

# Animation parameters
n_draws = 100 # the number of draws to visualize in the HOPs plots
frames_per_second = 2.5 # the speed of the HOPs
# 2.5 frames per second (400ms) is the recommended speed for the HOPs visualization.
# Faster speeds (100ms) have been demonstrated to not work as well.
# See Kale et al. VIS 2018 for more info.

Now that the plot aesthetics are set, we can return to our fit grid and repeatedly draw samples from the posterior mean evaluated at each x position in the grid using the add_fitted_draws function. Each frame of the animation shows a different draw from the posterior:

p = mydata %>% 
  data_grid(x1 = seq_range(x1, n = 101), x2) %>%
  add_fitted_draws(m, n = n_draws, seed = 12345) 
  # the seed argument is for reproducibility: it ensures the pseudo-random
  # number generator used to pick draws has the same seed on every run,
  # so that someone else can re-run this code and verify their output matches

animate(graph_plot(p), nframes = n_draws * 2, fps = frames_per_second)

For more context, we could also show the fit lines with the data:

animate(graph_plot(p) + 
          geom_count(aes(y = y), data = mydata, 
                     position=position_dodge(5)), 
        nframes = n_draws * 2, fps = frames_per_second)

Sampling from the prior

We already looked at some sample plots of the priors when we were setting priors; now we want to look at these priors again, but in a HOPs format so we can compare to the posterior plots. To get the prior plots, we can simply ask stan_glm to sample from the prior.

What to change

If you are knitting this document, or if you already ran the code in the “Check priors” section that calculates m_prior, you can comment out this line:

#m_prior = update(m, prior_PD = TRUE)

Then our code to generate plots is identical, except we replace m with m_prior:

p_prior = mydata %>% 
  data_grid(x1 = seq_range(x1, n = 101), x2) %>%
  add_fitted_draws(m_prior, n = n_draws, seed = 12345)

animate(graph_plot(p_prior), nframes = n_draws * 2, fps = frames_per_second) 

Again, with context:

animate(graph_plot(p_prior) + 
          geom_count(aes(y = y), data = mydata, 
                     position=position_dodge(5)), 
        nframes = n_draws * 2, fps = frames_per_second)